Exploratory Data Analysis¶

The main objectives for this notebook are:

  • Explore the clean dataset by performing univariate analysis
  • Investiage the relationships between the features and the target by perofrming bivariate and multivariate analyses
  • Extract relevant insights to share with business stakeholders
  • Understand steps that will be required for ML pre-processing

Notes¶

  1. Using Polars framwork instead of pandas
  2. Using interactive plots (e.g. Plotly) for visualisations
  3. Write clear insights after every section of the analysis
  4. Using well-written and documented utility functions

Imports¶

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import sys, os
import plotly.io as pio

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
    sys.path.append(path2add)
    
import polars as pl
import plotly.express as px
from visualisations import bar_plot, proportion_plot, boxplot_by_bin_with_target
# etc

pio.renderers.default='notebook'
In [3]:
import pandas as pd
data = pd.read_parquet("../data/supervised_clean_data.parquet")
In [4]:
data.head()
Out[4]:
_id inter_api_access_duration(sec) api_access_uniqueness sequence_length(count) vsession_duration(min) ip_type num_sessions num_users num_unique_apis source classification is_anomaly
0 0 1f2c32d8-2d6e-3b68-bc46-789469f2b71e 0.000812 0.004066 85.643243 5405 default 1460.0 1295.0 451.0 E normal False
1 1 4c486414-d4f5-33f6-b485-24a8ed2925e8 0.000063 0.002211 16.166805 519 default 9299.0 8447.0 302.0 E normal False
2 2 7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a 0.004481 0.015324 99.573276 6211 default 255.0 232.0 354.0 E normal False
3 3 82661ecd-d87f-3dff-855e-378f7cb6d912 0.017837 0.014974 69.792793 8292 default 195.0 111.0 116.0 E normal False
4 4 d62d56ea-775e-328c-8b08-db7ad7f834e5 0.000797 0.006056 14.952756 182 default 272.0 254.0 23.0 E normal False
In [5]:
data.shape
Out[5]:
(1695, 13)
In [6]:
data.info
Out[6]:
<bound method DataFrame.info of                                              _id  \
0        0  1f2c32d8-2d6e-3b68-bc46-789469f2b71e   
1        1  4c486414-d4f5-33f6-b485-24a8ed2925e8   
2        2  7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a   
3        3  82661ecd-d87f-3dff-855e-378f7cb6d912   
4        4  d62d56ea-775e-328c-8b08-db7ad7f834e5   
...    ...                                   ...   
1690  1694  3653d165-4b93-346b-9543-f1d4f5bf4831   
1691  1695  44356d09-52e9-321e-9ec1-630e582bfe53   
1692  1696  0ecdc692-df55-3990-815e-a30f1ee63f5f   
1693  1697  468a84b3-2885-30d6-b1a8-6cf2e44577cd   
1694  1698  2854b436-7d8b-3f2c-8139-3340ad2cd45a   

      inter_api_access_duration(sec)  api_access_uniqueness  \
0                           0.000812               0.004066   
1                           0.000063               0.002211   
2                           0.004481               0.015324   
3                           0.017837               0.014974   
4                           0.000797               0.006056   
...                              ...                    ...   
1690                       45.603433               0.800000   
1691                      852.929250               0.500000   
1692                       59.243000               0.800000   
1693                        0.754000               0.666667   
1694                       66.934857               0.428571   

      sequence_length(count)  vsession_duration(min)     ip_type  \
0                  85.643243                    5405     default   
1                  16.166805                     519     default   
2                  99.573276                    6211     default   
3                  69.792793                    8292     default   
4                  14.952756                     182     default   
...                      ...                     ...         ...   
1690               15.000000                   41044  datacenter   
1691                2.000000                  102352  datacenter   
1692                5.000000                   17773  datacenter   
1693                3.000000                     136  datacenter   
1694                7.000000                   28113  datacenter   

      num_sessions  num_users  num_unique_apis source classification  \
0           1460.0     1295.0            451.0      E         normal   
1           9299.0     8447.0            302.0      E         normal   
2            255.0      232.0            354.0      E         normal   
3            195.0      111.0            116.0      E         normal   
4            272.0      254.0             23.0      E         normal   
...            ...        ...              ...    ...            ...   
1690           2.0        1.0             12.0      F        outlier   
1691           2.0        1.0              1.0      F        outlier   
1692           3.0        1.0              4.0      F        outlier   
1693           2.0        1.0              2.0      F        outlier   
1694           3.0        1.0              3.0      F        outlier   

      is_anomaly  
0          False  
1          False  
2          False  
3          False  
4          False  
...          ...  
1690        True  
1691        True  
1692        True  
1693        True  
1694        True  

[1695 rows x 13 columns]>
In [7]:
data.isna().sum()
Out[7]:
                                  0
_id                               0
inter_api_access_duration(sec)    0
api_access_uniqueness             0
sequence_length(count)            0
vsession_duration(min)            0
ip_type                           0
num_sessions                      0
num_users                         0
num_unique_apis                   0
source                            0
classification                    0
is_anomaly                        0
dtype: int64
In [8]:
data = pl.read_parquet("../data/supervised_clean_data.parquet")
print(data.shape)
data.head()
(1695, 13)
Out[8]:
shape: (5, 13)
_idinter_api_access_duration(sec)api_access_uniquenesssequence_length(count)vsession_duration(min)ip_typenum_sessionsnum_usersnum_unique_apissourceclassificationis_anomaly
i64strf64f64f64i64strf64f64f64strstrbool
0"1f2c32d8-2d6e-3b68-bc46-789469…0.0008120.00406685.6432435405"default"1460.01295.0451.0"E""normal"false
1"4c486414-d4f5-33f6-b485-24a8ed…0.0000630.00221116.166805519"default"9299.08447.0302.0"E""normal"false
2"7e5838fc-bce1-371f-a3ac-d8a0b2…0.0044810.01532499.5732766211"default"255.0232.0354.0"E""normal"false
3"82661ecd-d87f-3dff-855e-378f7c…0.0178370.01497469.7927938292"default"195.0111.0116.0"E""normal"false
4"d62d56ea-775e-328c-8b08-db7ad7…0.0007970.00605614.952756182"default"272.0254.023.0"E""normal"false
In [9]:
data['ip_type'].unique()
Out[9]:
shape: (2,)
ip_type
str
"default"
"datacenter"

Univariate Analysis¶

This section goes through the avilable columns and plots them to see the distributions, outliers, etc. This is done to introduce the data set and to get familiar with it

In [10]:
bar_plot(data, "ip_type", "IP Type Counts",)

Observations:

  • There are just two ip types - default and datacenter, with default being the most frequent one

Features vs Target¶

This section performs a bi-variate analysis by looking at the distributions of normal vs outliers. This can help in determining what data and feature selection to perform.

In [11]:
proportion_plot(data, "ip_type", "is_anomaly", "Behaviour Type by Source")

Observations:

  • If the acitivty comes from a datacenter, it's guaranteed to be an outlier

Impact

  • The dataset needs to be filtered to include only default traffic since we don't need a model to classify datacenter traffic

Hypotheses¶

Are longer sessions with high speed inter API calls more anomalous?¶

It's usually the case that if a lot of events happen in a short period of time - this might signal bot or other malicious activity. Let's see if it's the case for this dataset

In [12]:
boxplot_by_bin_with_target(
    data = data,
    column_to_bin = "sequence_length(count)",
    numeric_column = "inter_api_access_duration(sec)",
    target = "is_anomaly"
)

Observations

  • Outliers have faster inter API duration than normal traffic

Insights

  • Longer sequences with faster inter API access duration are not more likely to be anomalous

Summary¶

Main Insights¶

  • Most of the traffic comes from the default source, only 9% comes from datacenters
  • All the datacenter traffic is considered to be anomalous
  • Longer sequences with faster inter API access durations are not more likely to be anomalous

Implications for Modelling¶

  • Dataset needs to be filter to include only the default source type